feat: schema diffing #1346

murali-db · 2025-09-24T10:58:58Z

What changes are proposed in this pull request?

Implements a schema diffing framework for Delta Kernel Rust to enable schema evolution. The diff algorithm uses field IDs (from column mapping) to track fields across schema versions, enabling detection of renames, additions, removals, and modifications at all nesting levels.
Java version here.
Summary of the allowed and blocked changes is TODO.

Field ID-based matching: Identifies fields by column mapping ID, not name (enables rename detection)
Nested field support: Handles changes in structs, arrays, and maps at any depth
LCA filtering: Reports only least common ancestors to reduce noise (e.g., reports "user" added, not "user.name", "user.email", etc.)
Breaking change detection: Distinguishes safe changes (nullability loosened) from breaking ones (nullability tightened, type changed)
Container nullability tracking: Detects array/map nullability changes separately from element changes

How was this change tested?

Unit tests.

kernel/src/schema/diff.rs

codecov · 2025-10-01T11:11:29Z

Codecov Report

❌ Patch coverage is 98.19639% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.80%. Comparing base (87e3552) to head (f6226f5).

Files with missing lines	Patch %	Lines
kernel/src/schema/diff.rs	98.19%	23 Missing and 13 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1346      +/-   ##
==========================================
+ Coverage   84.96%   85.80%   +0.84%     
==========================================
  Files         114      115       +1     
  Lines       29445    31441    +1996     
  Branches    29445    31441    +1996     
==========================================
+ Hits        25017    26977    +1960     
- Misses       3220     3243      +23     
- Partials     1208     1221      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kernel/src/schema/diff.rs

OussamaSaoudi · 2025-10-10T21:53:20Z

kernel/src/schema/diff.rs

+    }
+
+    #[test]
+    fn test_cursed_deeply_nested_complex_types() {


cursed

LMAO

Please add cursed cases with double nesting. Ex: array<array<>> map<array<>, array<>>, and map<, map<,>>

with structs ^

OussamaSaoudi

Just some small cleanup. Looks correct to me. Thank you!

kernel/src/schema/diff.rs

Replace string-based field paths with ColumnName type throughout the schema diffing framework for better consistency with the codebase. Changes: - Updated FieldChange, FieldUpdate, and FieldWithPath to use ColumnName - Updated SchemaDiffError to use ColumnName in error messages - Refactored path construction to use ColumnName::new() and join() - Replaced string prefix matching with proper ColumnName comparison - Updated all test assertions to use ColumnName All tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Changed unwrap() to expect() with a descriptive message to satisfy clippy linter requirements while maintaining safety guarantees. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add #![allow(dead_code)] to the schema diff module since this API is not yet consumed by other parts of the codebase. The functions are fully tested and ready for use in schema evolution features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Combine duplicate test_ancestor_filtering tests into single test with two cases (add and remove) - Update test_ancestor_filtering_with_mixed_changes to verify both ancestor and descendant changes are included - Expand test_array_element_struct_field_changes to verify add, remove, and update operations - Expand test_map_value_struct_field_changes to verify add, remove, and update operations - Add test_cursed_deeply_nested_complex_types for frightening nested container scenarios 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Verify that array<array<int>> -> array<array<double>> is detected as TypeChanged on the parent field. This demonstrates the recursion limitation where we cannot descend into non-struct container types since there are no field IDs at those intermediate levels. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Renamed the function to better reflect that it classifies changes between DataType instances, making the purpose more explicit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…maps This change improves the schema diff algorithm to correctly handle nested type changes in arrays and maps by using recursion in classify_data_type_change. Key improvements: - Recursively check element types in arrays (e.g., array<array<int>> -> array<array<double>>) - Recursively check key/value types in maps - Support combined changes (type + nullability) using FieldChangeType::Multiple - Only recurse when types actually differ to avoid false positives This fixes the issue where array<array<struct<a int>>> -> array<array<struct<a int, b int>>> would be incorrectly blocked. Now nested struct changes are properly detected via field IDs while also allowing detection of primitive type changes at any nesting level. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Created check_container_nullability_change() helper to reduce code duplication - Applied DRY principle for nullability checking logic used in arrays and maps - Added recursive type change detection for nested containers - Support combined changes (type + nullability) via FieldChangeType::Multiple - All 28 tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Clarified comments for map key/value type change detection - Existing code already handles recursion for arrays/maps containing structs - Example: map<array<struct<a int>>, int> -> map<array<struct<a int, b int>>, int> is correctly detected via recursive calls to classify_data_type_change - Map changes already combined via FieldChangeType::Multiple for: * Key type changes * Value type changes * Container nullability changes - All 28 tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Previously, field collection only handled one level of container nesting (e.g., array<struct<...>>). This refactoring extracts the container recursion logic into a new helper function that supports arbitrary nesting depth. The new `collect_fields_from_datatype` function recursively descends through container layers (arrays and maps) to find all nested struct fields, enabling support for deeply nested schemas like: - array<array<struct<...>>> - map<array<struct<...>>, array<struct<...>>> - map<array<array<struct<a int>>>, array<struct<b int>>> This uses mutual recursion between DataType processing and StructType processing to ensure all nested structs are discovered regardless of container nesting depth. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Metadata changes can be unsafe and breaking. For example, changes to row tracking metadata are not backward compatible. To remain on the safe side, we now classify all metadata changes as breaking changes. This updates the `is_breaking_change_type` function to include `FieldChangeType::MetadataChanged` in the set of breaking change types, and updates the test expectations accordingly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

When a diff is empty, it should never have breaking changes. This adds an assertion to verify this invariant in the test_identical_schemas test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Enhanced test robustness by adding assertions for all three field categories (added, removed, updated) instead of just checking one. This ensures tests catch unexpected changes in any category. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ported Removed redundant assert_eq!(diff.added_fields.len(), 1) that was checking the same condition twice in the test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Added assert!(diff.has_breaking_changes()) after verifying the nested field path to ensure the breaking changes flag remains correctly set throughout the test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Verify that rename operations (which are non-breaking) correctly report no breaking changes in the array struct element test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Verify that adding a non-nullable struct correctly reports as a breaking change in the ancestor filtering test with mixed changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add comprehensive test cases for nested array nullability changes: - test_nested_array_nullability_loosened: outer array element nullability loosened - test_nested_array_nullability_tightened: outer array element nullability tightened - test_nested_array_inner_nullability_loosened: inner array element nullability loosened - test_nested_array_inner_nullability_tightened: inner array element nullability tightened These tests verify that nested array nullability changes are correctly detected and classified as breaking or non-breaking changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Added three complex test cases to test_cursed_deeply_nested_complex_types: - Case 3: array<array<struct>> - doubly nested array with struct elements - Case 4: map<array<struct>, array<struct>> - map with arrays of structs - Case 5: map<struct, map<struct, struct>> - map with nested map and structs These tests verify field changes are correctly detected through extremely complex nested structures with proper path construction and breaking change detection. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Updated schema diffing logic to classify field removals as safe operations rather than breaking changes. This aligns with the principle that dropping columns is safe - existing data files remain valid and queries referencing removed fields will fail at query time, but data integrity is maintained. Changes: - Modified compute_has_breaking_changes() to not flag field removals - Updated 5 test assertions that previously expected removals to be breaking - All 32 schema diff tests pass Rationale: Field removals should not break existing readers since they can safely ignore unknown fields in data files. Query-time failures for removed fields are acceptable as they reflect the intentional schema change. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Co-authored-by: OussamaSaoudi <[email protected]>

Performance optimizations: - Add ColumnName::parent() method for O(1) parent path access - Remove ColumnName::is_descendant_of() (no longer needed) - Optimize has_added_ancestor() from O(N*M) to O(M) using parent chain - Add early return in classify_data_type_change() for identical types Code quality improvements: - Add check_field_nullability_change() helper for better code reuse - Refactor physical name validation into validate_physical_name() - Simplify nested match statements with helper functions Test coverage: - Add Case 6: Field nullability tightening in deeply nested structures - Add Case 7: Container nullability tightening in deeply nested structures All 32 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

zachschuermann · 2025-10-20T23:16:51Z

kernel/src/schema/diff.rs

+    /// The path to this field (e.g., ColumnName::new(["user", "address", "street"]))
+    pub path: ColumnName,
+    /// The type of change that occurred
+    pub change_type: FieldChangeType,


could we make this changes: Vec<FieldChangeType> and avoid the extra indirection through the FieldChangeType::Multiple variant?

github-actions bot assigned murali-db Sep 24, 2025

murali-db force-pushed the murali-db/schema-evol branch from 654aa32 to 3af7c20 Compare September 24, 2025 11:00

github-actions bot added the breaking-change Change that require a major version bump label Sep 24, 2025

murali-db requested review from OussamaSaoudi and zachschuermann September 24, 2025 11:45

OussamaSaoudi reviewed Sep 29, 2025

View reviewed changes

kernel/src/schema/diff.rs Show resolved Hide resolved

murali-db force-pushed the murali-db/schema-evol branch from 108283f to ed148a5 Compare October 1, 2025 11:18

murali-db requested a review from OussamaSaoudi October 1, 2025 11:49

murali-db changed the title ~~feat: schema diffing poc~~ feat: schema diffing Oct 1, 2025

OussamaSaoudi reviewed Oct 7, 2025

View reviewed changes

kernel/src/schema/diff.rs Show resolved Hide resolved

OussamaSaoudi reviewed Oct 7, 2025

View reviewed changes

murali-db requested a review from OussamaSaoudi October 9, 2025 15:39